Robust text line detection in historical documents: learning and evaluation methods
نویسندگان
چکیده
Text line segmentation is one of the key steps in historical document understanding. It challenging due to variety fonts, contents, writing styles and quality documents that have degraded through years. In this paper, we address limitations currently prevent people from building models with a high generalization capacity. We present study conducted using three state-of-the-art systems Doc-UFCN, dhSegment ARU-Net show it possible build generic trained on wide datasets can correctly segment diverse unseen pages. This paper also highlights importance annotations used during training: Each existing dataset annotated differently. unification its positive impact final text recognition results. end, complete evaluation strategy standard pixel-level metrics, object-level ones introducing goal-oriented metrics.
منابع مشابه
Text line detection in handwritten documents
Article history: Received 13 April 2007 Received in revised form 26 March 2008
متن کاملRobust Line Detection in Historical Church Registers
For being able to automatically acquire information recorded in church registers and other historical scriptures, the text of such documents needs to be segmented prior to automatic reading. Segmentation of old handwritten scriptures is difficult for two main reasons. Lines of text in general are not straight and ascenders and descenders of adjacent lines interfere. The algorithms described in ...
متن کاملA Two-Stage Method for Text Line Detection in Historical Documents
This work presents a two-stage text line detection method for historical documents. In a first stage, a deep neural network called ARU-Net labels pixels to belong to one of the three classes: baseline, separator or other. The separator class marks beginning and end of each text line. The ARU-Net is trainable from scratch with manageably few manually annotated example images (less than 50). This...
متن کاملSkew detection and text line position determination in digitized documents
-This paper proposes a computationally efficient procedure for skew detection and text line position determination in digitized documents, which is based on the cross-correlation between the pixels of vertical lines in a document. The determination of the skew angle in documents is essential in optical character recognition systems. Due to the text skew, each horizontal text line intersects a p...
متن کاملText Extraction from Historical Handwritten Documents by Edge Detection
Many national archives or libraries keep large amount of historical handwritten documents. One problem that many archivists are facing is the sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage. The result is that the handwritten characters from the reverse side appear as noise on the front side and even interfere with the front side char...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal on Document Analysis and Recognition
سال: 2022
ISSN: ['1433-2833', '1433-2825']
DOI: https://doi.org/10.1007/s10032-022-00395-7